Chapter 5

Question 1

Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

Data import and wrangling

library(tibble)
library(readr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(tidyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
human_new <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# no NA
any(is.na(human_new))
## [1] FALSE
# analyze and summary of variables
str(human_new)
## spc_tbl_ [155 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country  : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ Edu2.FM  : num [1:155] 1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num [1:155] 0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num [1:155] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num [1:155] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : num [1:155] 64992 42261 56431 44025 45435 ...
##  $ Mat.Mor  : num [1:155] 4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num [1:155] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num [1:155] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Edu2.FM = col_double(),
##   ..   Labo.FM = col_double(),
##   ..   Life.Exp = col_double(),
##   ..   Edu.Exp = col_double(),
##   ..   GNI = col_double(),
##   ..   Mat.Mor = col_double(),
##   ..   Ado.Birth = col_double(),
##   ..   Parli.F = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(human_new)
##    Country             Edu2.FM          Labo.FM          Life.Exp    
##  Length:155         Min.   :0.1717   Min.   :0.1857   Min.   :49.00  
##  Class :character   1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30  
##  Mode  :character   Median :0.9375   Median :0.7535   Median :74.20  
##                     Mean   :0.8529   Mean   :0.7074   Mean   :71.65  
##                     3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25  
##                     Max.   :1.4967   Max.   :1.0380   Max.   :83.50  
##     Edu.Exp           GNI            Mat.Mor         Ado.Birth     
##  Min.   : 5.40   Min.   :   581   Min.   :   1.0   Min.   :  0.60  
##  1st Qu.:11.25   1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65  
##  Median :13.50   Median : 12040   Median :  49.0   Median : 33.60  
##  Mean   :13.18   Mean   : 17628   Mean   : 149.1   Mean   : 47.16  
##  3rd Qu.:15.20   3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95  
##  Max.   :20.20   Max.   :123124   Max.   :1100.0   Max.   :204.80  
##     Parli.F     
##  Min.   : 0.00  
##  1st Qu.:12.40  
##  Median :19.30  
##  Mean   :20.91  
##  3rd Qu.:27.95  
##  Max.   :57.50
# Country is as rownames
human_new <- column_to_rownames(human_new, "Country")

Graphical overview

# histogram per variable
pivot_longer(human_new, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# charts and correlations
ggpairs(human_new,lower=list(combo=wrap("facethist",binwidth=0.1)))

# correlation between variables
c <- cor(human_new)
corrplot(c, method="circle")

Commenting distributions and relationships between them

The distribution of these variables appears to deviate from a normal distribution.

HDI, Life expectancy, education exp, education mean is negatively correlated with GII, Mat mortality, adolescent birth. GII and maternity mortality are also negatively correlated with education for female and male.

HDI, Life expectancy, education exp, education mean is positively correlated with education of male and female. GII, adolescent birth and maternity mortality are also positively correlated.

> This observed correlations appears coherent, highlighting logical relationships between the various variables.

Question 2

Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

Non-standardized PCA

# PCA
pca_human <- prcomp(human_new)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

# GNI has too big variances and is dominating the pca. The magnitude is too different between the metrics

Question 3 - standardized PCA

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

Scaled data & standardized PCA

# standardization of human_new
scaled_human <- scale(human_new)

# new PCA with scaled data
pca_human2 <- prcomp(scaled_human)

# draw a biplot of the PCA, adding captions that describe the phenomena
biplot(pca_human2, choices = 1:2,
   xlab="PC1 - Country Socio-Economic Development", ylab="PC2 - Female Empowerment and Socio-Economic Development by Women", main = " Principal Component Analysis of Socio-Economic Indicators: Mapping Country Development Factors and Female Empowerment")

# Check for missing values
any(is.na(scaled_human))
## [1] FALSE
# Check for infinite values
any(is.infinite(scaled_human))
## [1] FALSE

Include captions and descriptions on PCA:

PC1 - Country Socio-Economic Development. This component shows multiple socio-economic factors contributing to overall country development. It includes on one side, variables like Gross National Income, Human Development Index, Life Expectancy, Education, and on the other side, variables like Adolescent birth rate, Maternal mortality ratio. This axis reflects a broad aspect of a country’s overall progress and development.

PC2 - Female Empowerment and Socio-Economic Development by Women. This component specifically highlights aspects related to female empowerment, participation, and their impact on socio-economic development. It might incorporate variables such as Female Education, Female Labor Force Participation.

Question 4 - interpret the PCA

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points):

In the first PCA, the GNI variable has too high metrics and too big variances vs the other variables. This variable dominates the PCA and therefore the PCA cannot be analyzed properly. The magnitude is too different between the metrics to compare them together without scaling them. After scaling the dataset, I did the PCA again to compare them more equally.

Analyze the standardized PCA: The arrows aligned with PC1 and its associated variables provide insight into the relationships between the variables themselves, it is the same for the PC2. The direction of the arrow suggests that the pointed variables have a significant and positive influence on the principal components represented in the plot. The direction of the arrow indicates which way the variable is correlated with the principal component. The length of the arrow represents the strength of the variable’s contribution to that principal component. Longer arrows indicate a stronger influence on that component.

On the left side of the chart, countries appear more developed, characterized by higher values in key development indicators such as Gross National Income, Life Expectancy at Birth, Human Development Index, Life Expectancy in Education, and Female Population with Secondary Education. These variables has significant influence, suggesting that the variables impact positively each other. As these metrics increase, the level of development tends to be higher. We can find in this area many European/Western countries. Some of them also have the characteristics of higher labor rate among females and a closer rate when compared to the labor rate for males (e.g Iceland, Norway, Denmark). Conversely, countries like Qatar, Bahrain, UAE, and Saudi Arabia display a relatively good development level but possess lower female labor participation, impacting the overall women development and freedom.

On the opposite side of the chart, variables such as Adolescent Birth Rate, Maternal Mortality Ratio, Male Labor Force Participation Rate, HDI Rank, and GNI Rank (where higher ranks denote lower HDI or GNI) portray countries facing more challenging situations. Higher values in these variables correlate with less favorable conditions in a country. As these indicators rise, the situation tends to be less favorable. These countries, primarily from the Global South, such as Burkina Faso, Haiti, Niger, and Sierra Leone, face lower overall development. Women situations also vary based on their positioning along PC2, reflecting the diverse and inequal experiences of women in different countries.

Question 5

Tea data import and exploration

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions).

In addition, some personal details were asked (4 questions). Load the tea dataset and convert its character variables to factors: tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE)

Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.

tea_new <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# View and summary of the dataset
view(tea_new)
str(tea_new)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_new)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
# variables are factorial, except few ones. let's map the different categories thanks to the MCA

# But first let's keep only the categorical variables. An alternative would have been to change the non-catgeorical variables to categorical ones.
tea_new_2 <- tea_new %>% dplyr::select(-breakfast,-age)

Multiple Correspondence Analysis (MCA)

On the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

library(FactoMineR)

# MCA
mca <- MCA(tea_new_2, graph = FALSE)
mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 34 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$var$eta2"       "coord. of variables"             
## 8  "$ind"            "results for the individuals"     
## 9  "$ind$coord"      "coord. for the individuals"      
## 10 "$ind$cos2"       "cos2 for the individuals"        
## 11 "$ind$contrib"    "contributions of the individuals"
## 12 "$call"           "intermediate results"            
## 13 "$call$marge.col" "weights of columns"              
## 14 "$call$marge.li"  "weights of rows"
# summary of the MCA
summary(mca)
## 
## Call:
## MCA(X = tea_new_2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.092   0.084   0.072   0.060   0.057   0.055   0.050
## % of var.              5.913   5.392   4.629   3.881   3.672   3.502   3.188
## Cumulative % of var.   5.913  11.305  15.934  19.815  23.488  26.989  30.177
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.049   0.048   0.045   0.042   0.042   0.039   0.038
## % of var.              3.138   3.050   2.885   2.675   2.672   2.491   2.424
## Cumulative % of var.  33.315  36.365  39.251  41.926  44.598  47.089  49.513
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.036   0.036   0.034   0.032   0.032   0.031   0.030
## % of var.              2.320   2.305   2.173   2.077   2.049   2.016   1.944
## Cumulative % of var.  51.833  54.138  56.311  58.387  60.436  62.453  64.397
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.029   0.027   0.026   0.026   0.025   0.024   0.024
## % of var.              1.880   1.761   1.692   1.669   1.624   1.553   1.544
## Cumulative % of var.  66.276  68.038  69.730  71.399  73.023  74.577  76.120
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.023   0.022   0.022   0.021   0.020   0.019   0.019
## % of var.              1.456   1.426   1.404   1.339   1.264   1.246   1.221
## Cumulative % of var.  77.576  79.002  80.406  81.745  83.009  84.255  85.476
##                       Dim.36  Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.017   0.017   0.017   0.016   0.016   0.015   0.015
## % of var.              1.120   1.112   1.078   1.028   1.005   0.968   0.939
## Cumulative % of var.  86.596  87.708  88.786  89.814  90.819  91.786  92.725
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48  Dim.49
## Variance               0.013   0.013   0.012   0.012   0.011   0.010   0.010
## % of var.              0.866   0.837   0.802   0.739   0.697   0.674   0.666
## Cumulative % of var.  93.590  94.427  95.229  95.968  96.666  97.339  98.005
##                       Dim.50  Dim.51  Dim.52  Dim.53
## Variance               0.009   0.008   0.007   0.006
## % of var.              0.603   0.531   0.458   0.403
## Cumulative % of var.  98.608  99.139  99.597 100.000
## 
## Individuals (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1            | -0.629  1.432  0.202 |  0.148  0.086  0.011 |  0.109  0.055
## 2            | -0.427  0.660  0.139 |  0.288  0.330  0.064 | -0.110  0.056
## 3            |  0.097  0.034  0.006 | -0.156  0.096  0.015 |  0.125  0.073
## 4            | -0.560  1.133  0.227 | -0.279  0.308  0.056 | -0.026  0.003
## 5            | -0.173  0.108  0.028 | -0.149  0.088  0.020 |  0.027  0.003
## 6            | -0.681  1.675  0.272 | -0.293  0.340  0.050 | -0.008  0.000
## 7            | -0.223  0.181  0.037 |  0.015  0.001  0.000 |  0.183  0.155
## 8            | -0.031  0.003  0.001 |  0.111  0.048  0.009 | -0.097  0.043
## 9            | -0.063  0.014  0.003 |  0.266  0.281  0.048 |  0.388  0.697
## 10           |  0.177  0.114  0.021 |  0.369  0.539  0.090 |  0.315  0.459
##                cos2  
## 1             0.006 |
## 2             0.009 |
## 3             0.009 |
## 4             0.000 |
## 5             0.001 |
## 6             0.000 |
## 7             0.025 |
## 8             0.007 |
## 9             0.103 |
## 10            0.066 |
## 
## Categories (the 10 first)
##                 Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## Not.tea time | -0.552  4.238  0.236 -8.396 |  0.001  0.000  0.000  0.019 |
## tea time     |  0.428  3.285  0.236  8.396 | -0.001  0.000  0.000 -0.019 |
## evening      |  0.296  0.961  0.046  3.704 | -0.404  1.961  0.085 -5.051 |
## Not.evening  | -0.155  0.503  0.046 -3.704 |  0.211  1.025  0.085  5.051 |
## lunch        |  0.590  1.630  0.060  4.231 | -0.406  0.844  0.028 -2.907 |
## Not.lunch    | -0.101  0.280  0.060 -4.231 |  0.070  0.145  0.028  2.907 |
## dinner       | -1.047  2.447  0.082 -4.965 | -0.080  0.016  0.000 -0.379 |
## Not.dinner   |  0.079  0.184  0.082  4.965 |  0.006  0.001  0.000  0.379 |
## always       |  0.342  1.281  0.061  4.276 | -0.255  0.784  0.034 -3.193 |
## Not.always   | -0.179  0.670  0.061 -4.276 |  0.134  0.410  0.034  3.193 |
##               Dim.3    ctr   cos2 v.test  
## Not.tea time  0.063  0.071  0.003  0.959 |
## tea time     -0.049  0.055  0.003 -0.959 |
## evening       0.324  1.469  0.055  4.052 |
## Not.evening  -0.169  0.768  0.055 -4.052 |
## lunch         0.254  0.386  0.011  1.822 |
## Not.lunch    -0.044  0.066  0.011 -1.822 |
## dinner        0.744  1.581  0.042  3.531 |
## Not.dinner   -0.056  0.119  0.042 -3.531 |
## always        0.095  0.126  0.005  1.186 |
## Not.always   -0.050  0.066  0.005 -1.186 |
## 
## Categorical variables (eta2)
##                Dim.1 Dim.2 Dim.3  
## tea.time     | 0.236 0.000 0.003 |
## evening      | 0.046 0.085 0.055 |
## lunch        | 0.060 0.028 0.011 |
## dinner       | 0.082 0.000 0.042 |
## always       | 0.061 0.034 0.005 |
## home         | 0.013 0.002 0.027 |
## work         | 0.071 0.020 0.027 |
## tearoom      | 0.326 0.021 0.027 |
## friends      | 0.201 0.058 0.024 |
## resto        | 0.155 0.007 0.030 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",col.var = rainbow(ncol(tea_new_2)))

# It allows us to visualize many different categories from many catgeorical variables, all at once to see the categories that fit together and suggest association (similarities in behavior). For instance, unpackages and tea shop seem close to each other and separated from the rest of the categories. It suggests that they share some commonality / similar patterns or behavior in term of users preferences. Besides, they tend to occur together within the dataset.

# Select smaller dataset to do the MCA analyze with less variables
# let's now choose only few variables that could have correlations together but also show distinct behavior between users. 
tea_new_2 %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ name, scales = "free") +
  geom_bar() +
  labs(title = "Distribution of Categorical Variables") + 
  theme(axis.text.x = element_text(angle = 25, hjust = 1, size = 10)) +
  theme(strip.text = element_text(size = 10))

# smaller data set - I will look at 2 different ones. 
tea_smaller <- tea_new %>% dplyr::select(dinner, effect.on.health, frequency, evening, home, healthy, how, lunch, pub, price,Tea, where, work, sex)

tea_smaller2 <- tea_new %>% dplyr::select(diuretic, feminine, healthy, exciting, iron.absorption, escape.exoticism, relaxing, sophisticated, sex, slimming)

# MCA models
mca2 <- MCA(tea_smaller, graph = FALSE)
mca3 <- MCA(tea_smaller2, graph = FALSE)

# summary of the models
summary(mca2)
## 
## Call:
## MCA(X = tea_smaller, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.154   0.151   0.109   0.100   0.092   0.088   0.081
## % of var.              9.395   9.176   6.648   6.066   5.627   5.331   4.944
## Cumulative % of var.   9.395  18.572  25.220  31.286  36.914  42.245  47.189
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.081   0.075   0.071   0.068   0.064   0.063   0.061
## % of var.              4.911   4.595   4.306   4.129   3.887   3.820   3.690
## Cumulative % of var.  52.100  56.695  61.000  65.129  69.016  72.836  76.526
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.059   0.054   0.050   0.046   0.045   0.040   0.035
## % of var.              3.589   3.297   3.035   2.806   2.766   2.413   2.110
## Cumulative % of var.  80.114  83.411  86.446  89.252  92.018  94.431  96.542
##                       Dim.22  Dim.23
## Variance               0.033   0.024
## % of var.              2.020   1.438
## Cumulative % of var.  98.562 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   | -0.251  0.136  0.026 |  0.542  0.649  0.120 | -0.166
## 2                   | -0.235  0.120  0.070 |  0.139  0.042  0.024 | -0.231
## 3                   | -0.181  0.070  0.019 | -0.077  0.013  0.003 | -0.043
## 4                   | -0.158  0.054  0.015 |  0.447  0.442  0.124 | -0.200
## 5                   | -0.107  0.025  0.013 |  0.020  0.001  0.000 |  0.383
## 6                   | -0.266  0.153  0.029 |  0.673  1.001  0.185 | -0.216
## 7                   | -0.175  0.066  0.029 |  0.098  0.021  0.009 |  0.148
## 8                   | -0.125  0.034  0.014 |  0.139  0.042  0.017 |  0.213
## 9                   |  0.447  0.431  0.163 | -0.239  0.126  0.046 |  0.076
## 10                  |  0.559  0.675  0.229 | -0.326  0.234  0.077 | -0.156
##                        ctr   cos2  
## 1                    0.085  0.011 |
## 2                    0.164  0.068 |
## 3                    0.006  0.001 |
## 4                    0.122  0.025 |
## 5                    0.447  0.173 |
## 6                    0.142  0.019 |
## 7                    0.067  0.021 |
## 8                    0.139  0.039 |
## 9                    0.018  0.005 |
## 10                   0.075  0.018 |
## 
## Categories (the 10 first)
##                        Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## dinner              |  0.590  1.129  0.026  2.801 |  1.055  3.690  0.084  5.004
## Not.dinner          | -0.044  0.085  0.026 -2.801 | -0.079  0.278  0.084 -5.004
## effect on health    |  0.315  1.012  0.028  2.896 |  0.415  1.793  0.049  3.808
## No.effect on health | -0.089  0.286  0.028 -2.896 | -0.117  0.506  0.049 -3.808
## +2/day              |  0.138  0.373  0.014  2.044 | -0.578  6.698  0.245 -8.562
## 1 to 2/week         |  0.096  0.062  0.002  0.687 |  0.889  5.494  0.136  6.374
## 1/day               | -0.311  1.416  0.045 -3.659 |  0.467  3.277  0.101  5.502
## 3 to 6/week         |  0.229  0.275  0.007  1.416 | -0.298  0.477  0.011 -1.842
## evening             |  0.132  0.276  0.009  1.648 | -0.277  1.246  0.040 -3.460
## Not.evening         | -0.069  0.144  0.009 -1.648 |  0.145  0.651  0.040  3.460
##                        Dim.3    ctr   cos2 v.test  
## dinner              | -0.266  0.323  0.005 -1.260 |
## Not.dinner          |  0.020  0.024  0.005  1.260 |
## effect on health    |  0.787  8.908  0.175  7.226 |
## No.effect on health | -0.222  2.513  0.175 -7.226 |
## +2/day              | -0.225  1.405  0.037 -3.338 |
## 1 to 2/week         |  0.948  8.623  0.155  6.797 |
## 1/day               | -0.451  4.212  0.094 -5.309 |
## 3 to 6/week         |  0.875  5.669  0.098  5.407 |
## evening             |  0.433  4.202  0.098  5.409 |
## Not.evening         | -0.226  2.197  0.098 -5.409 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## dinner              | 0.026 0.084 0.005 |
## effect.on.health    | 0.028 0.049 0.175 |
## frequency           | 0.046 0.337 0.304 |
## evening             | 0.009 0.040 0.098 |
## home                | 0.001 0.023 0.035 |
## healthy             | 0.015 0.016 0.489 |
## how                 | 0.587 0.297 0.002 |
## lunch               | 0.000 0.152 0.049 |
## pub                 | 0.000 0.167 0.008 |
## price               | 0.617 0.224 0.148 |
summary(mca3)
## 
## Call:
## MCA(X = tea_smaller2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.172   0.133   0.119   0.113   0.101   0.094   0.079
## % of var.             17.193  13.350  11.935  11.267  10.062   9.371   7.921
## Cumulative % of var.  17.193  30.543  42.478  53.744  63.806  73.177  81.098
##                        Dim.8   Dim.9  Dim.10
## Variance               0.070   0.064   0.055
## % of var.              7.014   6.369   5.519
## Cumulative % of var.  88.112  94.481 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   | -0.705  0.964  0.496 | -0.036  0.003  0.001 |  0.267
## 2                   | -0.061  0.007  0.004 |  0.510  0.649  0.266 |  0.119
## 3                   | -0.160  0.050  0.034 | -0.400  0.400  0.213 |  0.317
## 4                   | -0.333  0.214  0.157 | -0.289  0.208  0.118 |  0.218
## 5                   | -0.512  0.509  0.251 | -0.050  0.006  0.002 |  0.263
## 6                   | -0.666  0.860  0.494 | -0.410  0.420  0.187 |  0.415
## 7                   | -0.666  0.860  0.494 | -0.410  0.420  0.187 |  0.415
## 8                   |  0.165  0.053  0.041 | -0.600  0.899  0.542 | -0.272
## 9                   | -0.636  0.783  0.462 | -0.272  0.185  0.085 | -0.150
## 10                  | -0.431  0.361  0.224 | -0.253  0.159  0.077 |  0.606
##                        ctr   cos2  
## 1                    0.198  0.071 |
## 2                    0.039  0.014 |
## 3                    0.281  0.133 |
## 4                    0.133  0.067 |
## 5                    0.193  0.066 |
## 6                    0.481  0.192 |
## 7                    0.481  0.192 |
## 8                    0.207  0.112 |
## 9                    0.063  0.026 |
## 10                   1.025  0.441 |
## 
## Categories (the 10 first)
##                         Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## diuretic            |   0.409   5.643   0.231   8.311 |   0.241   2.533   0.081
## Not.diuretic        |  -0.565   7.792   0.231  -8.311 |  -0.333   3.498   0.081
## feminine            |   0.799  15.957   0.481  11.997 |  -0.147   0.694   0.016
## Not.feminine        |  -0.603  12.038   0.481 -11.997 |   0.111   0.524   0.016
## healthy             |   0.239   2.323   0.133   6.309 |  -0.120   0.757   0.034
## Not.healthy         |  -0.557   5.421   0.133  -6.309 |   0.280   1.766   0.034
## exciting            |   0.069   0.108   0.003   0.951 |   0.993  28.589   0.622
## No.exciting         |  -0.044   0.068   0.003  -0.951 |  -0.626  18.023   0.622
## iron absorption     |   0.767   3.536   0.068   4.503 |   0.130   0.130   0.002
## Not.iron absorption |  -0.088   0.407   0.068  -4.503 |  -0.015   0.015   0.002
##                      v.test     Dim.3     ctr    cos2  v.test  
## diuretic              4.906 |   0.277   3.731   0.106   5.630 |
## Not.diuretic         -4.906 |  -0.383   5.153   0.106  -5.630 |
## feminine             -2.205 |  -0.373   5.022   0.105  -5.607 |
## Not.feminine          2.205 |   0.282   3.789   0.105   5.607 |
## healthy              -3.173 |   0.369   7.977   0.317   9.741 |
## Not.healthy           3.173 |  -0.861  18.614   0.317  -9.741 |
## exciting             13.640 |  -0.133   0.575   0.011  -1.829 |
## No.exciting         -13.640 |   0.084   0.363   0.011   1.829 |
## iron absorption       0.762 |  -0.065   0.036   0.000  -0.379 |
## Not.iron absorption  -0.762 |   0.007   0.004   0.000   0.379 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## diuretic            | 0.231 0.081 0.106 |
## feminine            | 0.481 0.016 0.105 |
## healthy             | 0.133 0.034 0.317 |
## exciting            | 0.003 0.622 0.011 |
## iron.absorption     | 0.068 0.002 0.000 |
## escape.exoticism    | 0.053 0.029 0.000 |
## relaxing            | 0.006 0.438 0.062 |
## sophisticated       | 0.173 0.002 0.106 |
## sex                 | 0.304 0.070 0.240 |
## slimming            | 0.267 0.041 0.245 |
# visualize MCA 
plot(mca2, invisible=c("ind"), graph.type = "classic")

plot(mca3, invisible=c("ind"), graph.type = "classic")

Conclusion of MCAs

MCA 2: The categories “tea shop,” “upscale price,” and “unpackaged” appear closely related, indicating they collectively represent a particular type of behavior or consumer preference. Similarly, “tea bag” and “chain store” seem to indicate anothe pattern within the dataset.

MCA 3: Categories associated with tea benefits like “iron absorption,” “healthy,” “sophisticated,” and “diuretic” form a cluster, suggesting they share similar characteristics or appeal to a specific segment of consumers. Oppositely, tea without special benefits stands apart, indicating a different category or behavior. Moreover, the two categories “exciting” and “not relaxing” but also the two categories “feminine tea” and “female” show similarities, aligning with their expected conceptual correspondence in this context.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.